OenoEDA by Brent Nixon

Data Set

In this project, I will work with the (Red) Wine Quality data set. This data set looks at 1599 instances of red wines from the Portugues ‘Vinho Verde’ wine. There are 12 attributes to describe each wine. Of these attributes, 11 are ‘physicochemical,’ such as types of acidity, measures of certain chemicals, and density. The 12th attribute is a subjective measure of wine quality, a numeric score from 0 to 10, calculated as the median of the score given by at least three wine experts.

Univariate Plots Section

Now, lets dive in to the data. First, I’ll run dim on the dataframe I created earlier to confirm how many observations I have and how many attributes pertain to each one.

## [1] 1599   13

1599 observations of 13 variables. I thought there were only 12 variables, so to look into what the 13th is, I will run str on the dataframe.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

I see that the extra variable is the first one, ‘X’, which appears to be the observation number, a serial ID starting at 1. Apart from the X and quality variables, all the other variables are numeric. There aren’t any factors, but I think that quality would be good candidate to convert as it is essentially an ordered factor already.

Now I’d like to run summary on the dataframe to get a statistical overview of each variable. I’ll be looking at the range of values, and for suggestions of outliers and skew in the data.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

At first glance, there appear to be several variables that are positively skewed and that possibly have outliers. To further examine this hunch, I will create boxplots of every variable to visually check for asymmetries.

From these boxplots, I can very easily see whether a distribution has outliers. I cannot as easily get a feel for how the different distributions are skewed. For that, I’ll chart the same variables, but with histograms instead.

From this group plot, I can see that the plots of some variables could be productively clustered together for better view and ease of comparison. I’ll do that now, clustering measures of acidity, sugar and chlorides, measures of sulfur, and a group with density, pH, alcohol, and quality.

These first three plots, which describe different measures of acidity, all have positive outliers. Fixed and volatile acidity have many, whereas citric acid only has one. The fixed and volatile acidity plots are slightly positively skewed and the citric acid plot is moderately positively skewed. These variables are all measured in g/dm^3.

Residual sugar and chlorides have similar plots, with narrow and symmetric bands of non-outlier data, and many positive outliers across a large range. The chlorides plot also has a few (directionally) negative outliers. Both residual sugar and chlorides are measured in g/dm^3.

Following the apparent trend where measures of similar properties plot in similar ways, the two measures of sulfur dioxide show a pattern of positively skewed non-outlier data and two clusters of positive outliers. The sulphates plot has a fairly symmetrical distribution of non-outlier data, followed by three clusters of positive outliers. Free and total sulfur dioxide are measured in mg/dm^3, and sulphates is measured in g/dm^3.

I don’t believe that density and pH are fundamentally related in wine, but their plots are fairly similar, with a roughly symmetric grouping of non-outlier data and a fair amount of both positive and negative outliers. Alcohol has a pretty unremarkable plot. Unlike all the other variables, quality skews somewhat positive with symmetrically placed positive and negative outliers. Density is measured in g/cm^3, pH is its own scale (0-14), alcohol is %, and quality is 0-10 integer score.

Those boxplots were good for looking at the amount of outliers, but the skew of the data was not as easy to grasp. I think that histograms are an easier way to examine skew, so I will plot these variables with histograms.

From this group plot, you can clearly see that many of these measurements are positively skewed. In others, apart from the effect of outliers, the distributions are normal to slightly right skewed. Density and pH, on the other hand, are just nice normal distributions.

All of the variables, except pH and density, look like they could benefit from a logarithmic scale transformation. I will plot them to see if there is any apparent benefit.

After adding a log10 transformation to their x-axes, most of these plots appear to be more normally distributed, which suggests that they are lognormally distributed. A few of the variables, citric acid, free sulfur dioxide, and alcohol, did not respond as well to the transformation as others. The logarithmic transformation should work well on data whose values span orders of magnitude. Since none of these variables really do that, I’m not surprised that some of the transformations were not particularly effective.

Later in this project, I decided to group the quality data into three categories, good, bad, and excellent, and then break down a few plots by those categories. I realized it could also be helpful to understanding the distribution of wines by quality in this section, so I added a wine quality histogram doing just that.

The figure above shows the histograms of quality and quality bucket, respectively. The quality buckets are 1-4, 5-6, and 7-10, representing bad, good, and excellent quality. Armed with that information and comparing the two chart side-by-side, it makes sense that the good bucket would be dramatically larger than the bad or excellent bucket. The histogram of the individual scores shows that wines with score 5 or 6 comprise most of the sample. Additionally, the excellent bucket gets a bump from the decent amount of wines with a score of 7.

Univariate Analysis

Structure of the Dataset

The dataset is composed of 1599 records, each described by the 12 variables:

  • “fixed.acidity”
  • “volatile.acidity”
  • “citric.acid”
  • “residual.sugar”
  • “chlorides”
  • “free.sulfur.dioxide”
  • “total.sulfur.dioxide”
  • “density”
  • “pH”
  • “sulphates”
  • “alcohol”
  • “quality”

All the variables are numeric, except quality, which has integers and could probably be converted to a factor.

All but three variables (pH, density, and quality) display at least some positive skew. This means that the majority of wines cluster within a certain range, but some wines have very high measures.

Almost all the wines score either a 5 or 6 on the quality scale, with a few 7s, and the occasional 3-4 or 8. This means that most wines are pretty good, but a few are amazing and a few are pretty darn bad.

Main feature(s) of interest

The main feature of interest in the dataset is the measure of wine quality. Even if one is a wine expert, individual measures, like free sulfur dioxide, are going to be hard to interpret. Using such a value to explain a pattern in quality, however, seems much more likely. As a non-wine expert, I would guess that alcohol and sugar content could be two variables that are more easily interpretable or predictive of quality than others.

Potentially helpful features

I think that combining all the variables into a regression model will end up providing the most signal from them.

New variables

I did not create any new variables. I will, however, convert the quality scale into an ordered factor.

Data adjustments

I adjusted the scale on the plots of a few of the variables, residual.sugar, chlorides, total.sulfur.dioxide, and sulphates, using a logarithmic transformation. Doing so improved all of them, suggesting lognormal distributions.

Bivariate Plots Section

Since I am trying to uncover patterns in the data, and relationships between variables, focusing my efforts with measures of correlation between variables can give me a headstart on that process. Below, I will plot a correlation matrix which shows the correlation coefficients between all the variables.

The correlation matrix shows that there are a few variables with high and medium correlations, and many with low correlations. Some of the high correlations don’t really have much signal, as they are correlations between very similar variables, like citric acid and volatile acidity, or total sulfur dioxide and free sulfur dioxide. These high correlations would be similar to a high correlation between ticket sales and total attendance at a sporting event, they are too connected to produce signal.

The variables with medium to high correlations that are reasonably independent of each other are:

Both high and negative correlations are interesting. Some of these correlations seem explainable by simple physics, but others could yield some unexpected patterns. For example, as sugar increases, density goes up. This is simple, you dissolve more solid into a liquid, the density increases. As alcohol increases, density decreases. Since alcohol is less dense than water, this makes sense. The less obvious relationships are sulphates to chlorides, alcohol to quality, and volatile acidity to quality. It will be useful to explore those relationships in greater detail.

Additionally, I’m curious about dividing variables between things that make a wine good, and things that make a wine bad. In other words, we know that high acetic acid and high total sulfur dioxide probably make a wine bad, because it tastes like vinegar or smells like rotten eggs. On the other hand, there are wine-making factors like sulphates* added, sugar in the wine, and how acidic the wine is allowed to get. I’m pretty sure that these factors can be controlled by the winemaker, and therefore could correlate to striking successes.

*Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

*Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

*Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

## 
##  Pearson's product-moment correlation
## 
## data:  wn$alcohol and wn$residual.sugar
## t = 1.6829, df = 1597, p-value = 0.09258
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.006960058  0.090909069
## sample estimates:
##        cor 
## 0.04207544

Plotting alcohol vs residual sugar shows that most wines have less than 1 g/L sugar, and that those wines range from around 9 to 14% alcohol. As alcohol increases, the highest ranges of residual sugar in wines decreases, from around 16 g/L to about 6 g/L. This makes sense as wines with higher alcohol have converted more of the fuel, sugar, into alcohol. This weak trend is only visible in the a handful of points. As the correlation test shows, there is no real relationship between alcohol and residual sugar.

## 
##  Pearson's product-moment correlation
## 
## data:  wn$alcohol and wn$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

This plot suggests a moderate relationship between alcohol and quality. In the highest and lowest ranges of quality, there is a weak (created by few data points) trend of higher alcohol associating with higher quality, and vice versa. A stronger trend is that ‘good’ wines (with scores from 5-7) span a broad range of alcohol content, between 9 and 11%, as well as demonstrating a the same trend of higher alcohol correlating with higher quality. The correlation test yields a 0.48 correlation between alcohol and quality. The blue line is a linear model fit to the data and demonstrates the positive trend.

## 
##  Pearson's product-moment correlation
## 
## data:  wn$residual.sugar and wn$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

Plotting residual sugar vs quality shows that most wines have a residual sugar content between 1 and 3 g/L. The better wines (score of 4-8) have more outliers with sugar content out to 15 g/L, but they almost entirely cluster between 1 and 3 g/L. As the correlation test shows, there is almost zero relationship between residual sugar and quality.

## 
##  Pearson's product-moment correlation
## 
## data:  wn$pH and wn$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

As shown by the correlation test, there is essentially no relationship between pH and quality. Most wines fall within a pH of 3.1 and 3.6 and the average quality wines span a broader range of pH than the better and worse wines.

## 
##  Pearson's product-moment correlation
## 
## data:  wn$sulphates and wn$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

As correlation test show, there is only a weak positive relationship between sulphates content and quality. The blue line represents the linear model fit to the data, and shows the trend of this relationship. The average score wines have a wider range sulphate content.

I was hoping to find some interesting relationships between sugar, pH, sulphates, alcohol and quality, but except for alcohol, and to some extent, sulphates, there weren’t any notable patterns. Since the first set of positive drivers didn’t turn up much, I can’t help but want to check three likely negative drivers: total sulfur dioxide, volatile acidity, and chlorides.

## 
##  Pearson's product-moment correlation
## 
## data:  wn$total.sulfur.dioxide and wn$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

Very little to see here. There is a small negative correlation between total sulfur dioxide and quality. As wines get lower in quality, there is some trend of them having higher total sulfur dioxide measures.

## 
##  Pearson's product-moment correlation
## 
## data:  wn$volatile.acidity and wn$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

There is a moderate negative correlation (-0.39) between volatile acidity and quality. As wines get lower in quality, they tend to have higher volatile acidity. The blue linear regression line shows this negative trend.

## 
##  Pearson's product-moment correlation
## 
## data:  wn$chlorides and wn$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

There is only a very very weak negative relationship (-0.13) between chlorides and quality. There are a handful of wines with average scores who have a wide range of chloride content.

Bivariate Analysis

From the correlation tests and bivariate plots, only a few weak to moderate relationships were uncovered. The strongest positive correlation was 0.48 between alcohol and quality. The only other notable positive correlation was 0.25 between sulphates and quality. The strongest negative correlation was -0.39 between volatile acidity and quality. The next was -0.19 between total sulfur dioxide and quality. The last notable negative correlation was -0.13 between chlorides and quality.

I don’t understand why higher alcohol content would be related to higher quality, but there was medium correlation between the two. The small positive correlation between sulphates and quality makes sense because sulphates are added to help control fermentation processes and prevent the wine from going bad.

The respective negative correlations between volatile acidity, sulfur dioxide, chlorides, and quality also makes sense because wines that taste like vinegar, smell like rotten eggs, or are salty probably aren’t very high quality. The descending order of correlation strength between those three variables and quality could be owing to the discernability of each attribute. Maybe acetic acid is easier to detect than total sulfur dioxide, which is easier to detect than chloride content.

Multivariate Plots Section

There are a few plots I would like to explore with more than two variables. For starters, I would like to group the quality data into three categories, good, bad, and excellent, and then break down a few plots by those categories.

Here are the three plots I will explore:

The above chart shows residual sugar plotted against alcohol, faceted by quality bracket, with the overall median alcohol content shown by the vertical lines. The pattern I see is that a majority of both the worst and best wines have an alcohol content less than the overall median alcohol content. Thd best wines, on the other hand, almost all have an alcohol content fairly higher than the median. Except for the good wines, alcohol does not seem to have a relationship with residual sugar. The slight relationship visible with good wines could simply be owing there being many more data points for that quality bracket.

The relationship we see in this plot is that bad wines have a sulphate content slightly below the median value (shown again by vertical lines), good wines have a below-median content, but a little less below than that of bad wines. Lastly, excellent wines have an above-median sulphate content.

From this last plot, we can see that the bad wines generally have a pH above the pH median (shown by vertical lines). Also, more of them have higher volatile acidity measurements than the excellent wines. In other words, they have low acidity and are often slightly vinegary (mmm…). Good wines usually straddle the pH median, but are lower in volatile acidity than bad wines. Excellent wines, however, are a bit lower in acidity than the others as well as having the lowest amount of volatile acidity among the three quality brackets.

I couldn’t help but try to fit the data to a linear model. I made eleven versions of the model, starting with one variable, and updating each subsequent model with an additional variable. By the end, the R-squared value was 0.361, which isn’t the best.

## 
## Calls:
## wnm1: lm(formula = quality ~ fixed.acidity, data = wn)
## wnm2: lm(formula = quality ~ fixed.acidity + volatile.acidity, data = wn)
## wnm3: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid, 
##     data = wn)
## wnm4: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar, data = wn)
## wnm5: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides, data = wn)
## 
## ==========================================================================================
##                         wnm1          wnm2          wnm3          wnm4          wnm5      
## ------------------------------------------------------------------------------------------
##   (Intercept)           5.157***      6.451***      6.450***      6.439***      6.541***  
##                        (0.098)       (0.121)       (0.121)       (0.123)       (0.124)    
##   fixed.acidity         0.058***      0.012         0.014         0.014         0.006     
##                        (0.012)       (0.011)       (0.015)       (0.015)       (0.015)    
##   volatile.acidity                   -1.732***     -1.746***     -1.752***     -1.608***  
##                                      (0.107)       (0.127)       (0.128)       (0.130)    
##   citric.acid                                      -0.032        -0.042         0.174     
##                                                    (0.152)       (0.153)       (0.159)    
##   residual.sugar                                                  0.007         0.008     
##                                                                  (0.013)       (0.013)    
##   chlorides                                                                    -2.019***  
##                                                                                (0.413)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.015         0.153         0.153         0.153         0.166     
##   adj. R-squared        0.015         0.152         0.152         0.151         0.163     
##   sigma                 0.802         0.744         0.744         0.744         0.739     
##   F                    24.960       144.319        96.170        72.166        63.351     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1914.235     -1793.729     -1793.707     -1793.564     -1781.641     
##   Deviance           1026.127       882.555       882.530       882.372       869.310     
##   AIC                3834.471      3595.459      3597.415      3599.128      3577.281     
##   BIC                3850.602      3616.968      3624.300      3631.391      3614.921     
##   N                  1599          1599          1599          1599          1599         
## ==========================================================================================
## 
## Calls:
## wnm6: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide, data = wn)
## wnm7: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide, 
##     data = wn)
## wnm8: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density, data = wn)
## wnm9: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH, data = wn)
## wnm10: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates, data = wn)
## wnm11: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = wn)
## 
## ============================================================================================================
##                             wnm6          wnm7          wnm8          wnm9         wnm10         wnm11      
## ------------------------------------------------------------------------------------------------------------
##   (Intercept)               6.641***      6.663***    166.584***    182.592***    189.679***     21.965     
##                            (0.131)       (0.129)      (14.259)      (14.795)      (14.266)      (21.195)    
##   fixed.acidity             0.001        -0.016         0.119***      0.174***      0.172***      0.025     
##                            (0.015)       (0.015)       (0.019)       (0.023)       (0.023)       (0.026)    
##   volatile.acidity         -1.623***     -1.404***     -1.219***     -1.256***     -0.984***     -1.084***  
##                            (0.130)       (0.131)       (0.127)       (0.127)       (0.125)       (0.121)    
##   citric.acid               0.176         0.464**       0.202         0.170         0.047        -0.183     
##                            (0.158)       (0.160)       (0.156)       (0.156)       (0.150)       (0.147)    
##   residual.sugar            0.014         0.022         0.080***      0.085***      0.095***      0.016     
##                            (0.014)       (0.013)       (0.014)       (0.014)       (0.013)       (0.015)    
##   chlorides                -2.005***     -2.071***     -1.195**      -0.645        -2.278***     -1.874***  
##                            (0.412)       (0.405)       (0.398)       (0.421)       (0.431)       (0.419)    
##   free.sulfur.dioxide      -0.004*        0.008**       0.006**       0.005*        0.004         0.004*    
##                            (0.002)       (0.002)       (0.002)       (0.002)       (0.002)       (0.002)    
##   total.sulfur.dioxide                   -0.006***     -0.005***     -0.004***     -0.004***     -0.003***  
##                                          (0.001)       (0.001)       (0.001)       (0.001)       (0.001)    
##   density                                            -161.857***   -180.667***   -188.401***    -17.881     
##                                                       (14.431)      (15.178)      (14.638)      (21.633)    
##   pH                                                                  0.675***      0.625***     -0.414*    
##                                                                      (0.175)       (0.169)       (0.192)    
##   sulphates                                                                         1.261***      0.916***  
##                                                                                    (0.113)       (0.114)    
##   alcohol                                                                                         0.276***  
##                                                                                                  (0.026)    
## ------------------------------------------------------------------------------------------------------------
##   R-squared                 0.169         0.198         0.256         0.263         0.317         0.361     
##   adj. R-squared            0.166         0.194         0.253         0.259         0.312         0.356     
##   sigma                     0.738         0.725         0.698         0.695         0.670         0.648     
##   F                        53.851        55.968        68.538        63.097        73.611        81.348     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1778.901     -1750.636     -1689.759     -1682.345     -1622.136     -1569.138     
##   Deviance                866.336       836.243       774.933       767.779       712.083       666.411     
##   AIC                    3573.801      3519.271      3399.519      3386.689      3268.271      3164.277     
##   BIC                    3616.818      3567.666      3453.290      3445.838      3332.797      3234.179     
##   N                      1599          1599          1599          1599          1599          1599         
## ============================================================================================================

After building the model, I used the predict function to create a predicted quality score for each wine in the data set. Since the predict function yielded predictions with decimal valuesbut the actual quality scores are integers, I used the round function to convert the predictions into integers.

To check the accuracy of these rounded predictions, I plotted the difference between each wine’s quality score and the model prediction. As you can see below, about 60% of the predictions were correct, which is not terrible. It appears that the model predicted a bit high more than it did low. Some further areas to explore with this model would be refitting it with log-transformed data or 95th percentile data.

Multivariate Analysis

In this past section, I looked at the relationships between alcohol and residual sugar, sulphates and total sulfur dioxide, and pH and volatile acidity. I cut the quality variable into three brackets, bad, good, and excellent, and broke the three plots down by these brackets.

The goal of adding this third variable was to use the quality information to infer some pattern about how my assumed independent variables affected the dependent ones like total sulfur dioxide, volatile acidity and residual sugar content in a way that drove a difference in quality.

My overall summary is that the best wines are not sweet and range from 10-12% alcohol. Bad wines are lower in sulphates, good wines are higher in sulphates, but total sulfur dioxide does not seem to be related to that sulphate content. Bad wines have higher volatile acidity and lower total acidity (pH). The best wines have lower volatile acidity and a bit lower total acidity than most.

The linear model I built had an r-squared of 0.361 and about 60% accuracy in predicting wine quality. Since there are so many more good wines than bad or excellent wines, I wouldn’t be surprised if the prediction accuracy is worse for bad and excellent wines because there were fewer data points in those ranges. Possible options for improving the model include using log-transformed data and data trimmed to the 95th percentiles.


Final Plots and Summary

Plot One

Description One

I picked this plot because it is one of the strongest negative correlations (that is also interesting) in the dataset. As shown on the graph, there is a -0.39 (moderate) correlation between volatile acidity and wine quality. Volatile acidity is essentially the amount of acetic acid, the primary component in vinegar, in wine. When volatile acidity gets too high, the wine takes on an unpleasant, vinegar taste. This explains the trend shown by the grey line which represents a linear model fit to the data. As volatile acidity increases, quality decreases.

Plot Two

Description Two

I chose this plot, of residual sugar against alcohol, faceted by quality bucket, because it is one of the few plots where there was a noticeable trend. The three panels show how alcohol and sugar content relate to each other for bad, good, and excellent wines. Superimposed on each panel are vertical lines representing the median alcohol content for wines in each particular quality bucket, as well as overall.

For bad wines, we see that the median alcohol content is below the median for all wines. There is only one wine with a residual sugar content above ~8 g/dm^3, which is probably reflective of the low number of samples for this quality bucket.

Good wines are similar to bad wines, with below-median alcohol content, although there is a weak negative linear trend between alcohol content and residual sugar that was not apparent in bad wines.

Excellent wines have an alcohol content that is above-median with a spread noticeably greater than that seen in the lower two quality buckets. For the lower two quality buckets there are also a few more wines with higher sugar content (> ~8 g/dm^3) than there are for excellent wines, of which there are none above ~8 g/dm^3.

Plot Three

Description Three

I like this plot because it shows a slight pattern that I was curious about earlier in the project. The pattern was regarding sulphate content, total sulfur dioxide, and quality. This plot demonstrates that pattern by plotting total sulfur dioxide against sulphates, and separating by quality bucket.

In the plot above, you can see that the bad wines have a slightly above-median sulphate content. The good wines have a sulphate content close to the overall median. Excellent wines have a moderately below-median sulphate content.

As far as sulphate content being related to total sulfur dioxide levels, it appears maybe that the good wines have more higher values. I suspect that this is simply owing to there being quite a lot more good wines than bad or excellent.

In summary, the most noticeable pattern is the below-median sulphate content of excellent wines.


Reflection

Looking back on this project, I got to break out twelve different attributes of 1599 red wines. Another way to look at this data set is 11 features that point towards a target variable, the quality score. I gained intuition about the data set by plotting all the variables with boxplots and histograms, and by running statistical summaries of the variables.

From then on, I compared multiple variables to each other, looking for patterns and relationships that would give me insight into how physico-chemical attributes of these wines affect each other and contribute to quality. I did this by making scatter plots, faceting by quality, superimposing measures of central tendency, as well as by calculating correlations between the same sets of variables to check for, and quantify any such relationships.

There were some moderate relationships between variables and quality. The most notable were between alcohol and quality, with a positive correlation was 0.48, between volatile acidity and quality, with a negative correlation of 0.39, and between sulphates quality, with a positive correlation of 0.25. Building a linear regression model with all the features gave an R-squared value of 0.361, which was underwhelming.

One big struggle with this data is that there are many more good wines than there are bad or excellent ones. In the buckets with fewer samples, it seems likely that a representative distribution of values for each variable is not seen, which makes it harder to compare with good wines and to draw conclusions.

One way to make this data set a lot more interesting would be to add in location, winemaker, varietal, and pricing information. Getting more samples in the bad and excellent quality ranges would also really improve the data. Lastly, it would be interesting to run this data through some machine learning algorithms to build a prediction engine better than the linear regression model that I made.